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(54) A method of producing a checkpoint which describes a base file and a method of generating 
a difference file defining differences between an updated file and a base file 



(57) A checl<point wtiich describes a base file is pro- 
duced by dividing the base fi le into a series of segments; 
generating for each segment a segment description; 
and creating from the generated segment descriptions 
a segment description structure as the checkpoint. The 
segment descriptions represent segments of the base 
file at a minimum level of resolution sufficient to repre- 
sent distinctly the segment. A difference file which de- 
fines differences between an updated file and the base 
tile is produced by generating at different levels of res- 
olution segment descriptions for segments in the updat- 
ed file and comparing the generated segment descrip- 
tions with segment descriptions in the checkpoint to 
identify matching and non-matching segments. Data 
identifying segments in the updated file that match seg- 
ments in the base file and data representing portions of 
the updated file at a minimum level of resolution suffi- 
cient to represent distinctly the portion are stored as the 
differanca file. 
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Description 

Technical Field 

[0001] The invention relates to a method of producing 
a checkpoint which describes a box file and a method 
of generating a difference file defining differences be- 
tween an updated file and a base file. The invention can 
be applied for example to network systems where a re- 
mole copy of a file is kept up-to-date by the transmission 
and application of the differences between the succes- 
sive versions of the local copy, thereby using bandwidth 
more efficiently This includes modem on-line backup 
and data replication systems, and network computer 
systems that enable applications to transmit only the 
changes to memory -loaded files from client to server on 
successive save operations The invention can also be 
applied for example to backup subsystems, where stor- 
ing only a difference to files can make more economical 
use of storage media. 

i3ackqround Of The Invention 



[0002] Methods that determine how to transform one 
file into another have long been of interest to computer 
scientists. Today, many such methods exist. Capital is 
made from the fact that generated descriptions of a 
transformation can usually be made smaller than the 
would-be transformed file. In the main, therefore, these 
techniques are applied to files that are successively 
modified. Both a base and an updated version of a file 
is taken, and a description of how to transform the base 
file into the updated version is generated. Such descrip- 
tions of incremental transformation are used for things 
like reducing the expense of storing file histories and for 
keeping remote copies of changing files up-to-date, 
[0003] Source code control systems provide some of 
the earliest examples of such difference or transforma- 
tion calculation techniques in practice. These systems 
are used in software projects to keep version histories 
of textual source code files, which are likely to be mod- 
ified many times over their lifetime. As storage space is 
at a premium, it is prohibitively expensive to store the 
large number of successive versions of each fi le whole. 
Instead, the typical solution is to store the first version 
of a file and thereafter only record only the line by line 
difference between following versions. When a pro- 
grammer makes a request for a particular version of a 
tile, the system takes the earliest version of the file, 
which is stored whole, and sequentially applies the suc- 
cessive differences belwcQn the versions until the ear- 
liest version has been transformed into the requested 
version. An early description of such a system can be 
found in a technical paper by M. J. Rochkind, tilled "The 
Source Code Conlrol System", IEEE Transaction on 
Software Engineering, Vol SE-1, No. 4. Dec 1975, PP 
364-370. 

[0004] Rochkind's system describes differences by 
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the line of text, but more modern techniques describe 
differences at the level of individual bytes. These tech- 
niques have found important application on networks 
where transmission of data is expensive. As a way of 
s saving bandwidth, particularlyover modem lines and the 
Internet, updates to files are olten distributed asdescri|>- 
lions of byte level differences, or binary patches, from 
previous versions. Such a technique is widely used in 
the distribution of updates to software packages. tHere 

10 vendors often want to update executable files installed 
on users' computers because a security flaw or some 
other problem has been discovered. Rather than asking 
(hem to download updated versions of the affected files 
wtiole, binary patches representing a minimal descrip- 

»S tion of how the old file versions need to be rrradified are 
generated. The binary patches are then made available 
for downloading and users can quickly obtain and apply 
them to transform the problem files into the revised ver- 
so [0005] Despite the widespread use of the aforemen- 
tioned traditional patching techniques however, they 
have proved inadequate for some new types of network 
application. Problems have arisen with the need to have 
both the base and updated versions of files to hand to 

ss calculate differences. The new applications often need 
to transfer only the difference between successive ver- 
sions of files to economize on bandwidth, but cannot af- 
ford the expense associated with storing local copies of 
both the base and updated versions of every file. An ex- 

30 ample of such a situation occurs in the newly emerging 
field of on-line backup systems. Here backup servers 
store copies of large numbers of clients' files, and these 
typically have to be kept up-to-date using a slow con- 
nection available for data translsr. Some backed-up 

35 (iles, such as mailboxes, may be tens of megabytes in 
size yet change regularly by only a few kilobytes on each 
modification. In such cases, it is only practical to transmit 
the difference between the last stored copy of the file 
and its latest version on each backup. But implementing 

'to this scheme utilizing traditional techniques necessitates 
clients keeping local copies of Ihe last transmitted ver- 
sions of backed up files. This means that the space con- 
sumed by backed up files is effectively doubled. 
[OOOG] The problems arising from applying traditional 

■fs patching techniques to on-line backup systems can be 
witnessed in those that use them. Such a system is de- 
scribed in U.S. Pat. No. 5,634,052 issued on May 27, 
1997 to Robert J. T. Morris and assigned to International 
Business Machines Corporation. Hot Wire Data Securi- 

50 ty, Inc has implemented a similar system called Back- 
upNet (wwrtv backupnet com). In these systems the cli- 
ent actually keeps copies of the last versions of files that 
have been transferred to Ihe server in a cache. On the 
next backup, these are used to generate patches for 

S5 modified tiles that need to be updated on the server. 
When the technique finds a match in the cache it can 
generate minimal size patches because it has both base 
and updated file versions to hand. But unfortunately 
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storage restrictbns on typical machines constrain cach- 
es to holding only a fraction of tha files assigned to the 
backup system, especially where large files are in- 
volved. Therefore even if files can be entered and de- 
leted from the cache on an accurate most-likely-to-be- 
modified basis, numerous cases always occur where an 
entire updated file, rather than just a patch, has to be 
transmitted. 

[0007] A new class of patching technique, has 
evolved to reduce dramatically the number of aforemen- 
tioned cache misses. In techniques of the new class, 
special difference checkpoint data is derived from the 
base file that can later be substituted for it during patch 
generation. Checkpoints are designed to consume only 
a tiny fraction of their corresponding base file's storage, 
but still contain sufficient information to allow a binary 
patch to be calculated with good efficiency. A basic 
tradeoff often exists, where the smaller checkpoints are. 
and the less information they hold, the more inaccurate 
the difference calculation and the larger the size of the 
generated patch But the tradeoff can be balanced ac- 
cording to the situation and so better solutions can usu- 
ally be achieved than with traditional methods. A de- 
scription of a chGckpoinl-based patching technique can 
be found in U.S. Pat No. 5,479,654 issued on Dec. 26, 
1 995 to Squibb and assigned to Squibb Data Systems, 
Inc. An example of such a technique in practice can be 
found in Connected Corporation's Delta Blocking tech- 
nology, as used in their Connected On-line Backup sys- 
tem (www connected com). 

[0008] Difference checl<points can be constructed in 
many ways, but at the time of writing all are based upon 
digital signatures. Represented files are divided into 
equal sequential segments, and a digital signature is 
calculated for each and stored in the checkpoint. The 
signatures require onty a very small amount of space to 
store, but parform a fingerprinting function that allows 
the bytes in a segment to be uniquely identified beyond 
a reasonable doubt. One popular signature that has 
been standardized by the CCITT is the 32 bit CRC, a 
discussion of which can be found in a technical article 
by f^ark Nelson titled "File Verification Using CRC", Or 
Dobb's Journal May 1992. Each 32 bit CRC consumes 
four bytes of storage, so if a segment size of one kilobyte 
is chosen checkpoints can be constructed that consume 
only one percent of their corresponding file's size. How- 
ever, by searching a file for segment lengths of bytes 
with signatures matching those stored in the checkpoint, 
blocks of bytes can be identified that are present in the 
represented file. The tradeoff can be seen to be that the 
smaller the segment length chosen, the mors accurately 
the difference can usually be calculated, but the more 
signatures generated and Ihe more space needed to 
store the checkpoint. In practice though, using a stand- 
ard segment length of 51 2 bytes where medium to large 
files are involved results in patches being calculated that 
ars only one or two percent larger than those calculated 
with traditional techniques. 



[0009] However, while checkpoint stored signatures 
provide a means to match segments in an updated file 
with segments in a base file, they cannot provide a sat- 
isfactory solution on their own. Segments of bytes in an 

5 updated file that have signatures matching those of se- 
quential base file segments may occur at any offset and 
in any order Therefore without any supplementary 
method, only a prohibitively expensive route for finding 
every identifiable segment is available. This must in- 

10 volve calculating the signature of a segment's length of 
bytes following every offset in the updated file, and 
checking whether it matches a signature in the check- 
point It is quite reasonable to increment the olfset in the 
updated file by a segment's length when a matching 

IS segment is found, so when the base and updated files 
are identical only as many signatures will be calculated 
as sequential segments they hold will be calculated. But 
in the worst case where the files share no reused seg- 
ments, almost as many signatures will be calculated as 

20 there are bytes in the updated file. As signature calcu- 
lations involve passing every byte in the respective seg- 
ment through a complex function, it is clear that the com- 
putational complexity of the worst case is far too great. 
[0010] To reduce the aforementioned computational 

ss complexity, some techniques simply avoid trying to iden- 
tify every reused segment possible. In its simplest form, 
this invotves assuming that if the updated file contains 
segments from the base file, then they will be present 
at the offset at which they were originally sequenced. 

30 Signatures are calculated lor sequentialsegments in the 
updated file and then compared directly with the check- 
point-stored signature of the equivalent sequential seg- 
ment in the base file This ensures that only as many 
signatures as there are sequential segments in the up- 

35 dated file are calculaled. As a consequence of this ap- 
proach though, these techniques fall down even in the 
simple case where a file is modified by the insertion of 
data. In such a case where a base tile has a single byte 
prefixed to the beginning, thereby altering all of the seg- 

to ment alignments, no matches will be found and a patch 
is calculaled thai is the same size as updated file. Be- 
cause of this methodology's i nability to deal with the ma- 
jority of tile modifications, it is generally considered in- 
adequate. Instead, techniques have centered upon 

45 checking for matches at each possible offset, by finding 
ways of discounting non-matching segments before 
having to calculate their signature 
[001 1 ] The preferred method of improving the efficien- 
cy of patch generation is to supplement checkpoints with 

50 data extraneous to the fingerprint matching process. 
Such data is included purely for the improvement of ef- 
ficiency and it is not responsible for the final identifica- 
tion of reused segments. Squibb's technique manifests 
such an approach and places three different but increas- 

SS ingly expensive types of signature in the checkpoint, on- 
ly the most expensive of which is used to irrefutably 
identify segments. The signatures consist of an XOR of 
a subset of bytes from the segment, a 16 bit CRC of all 
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the bytes in the segment, and finally a 32 bit CRC of all 
the bytes in the segment. At each offset in the provided 
file where he believes a segment from the represented 
file may be found, he first calculates the relatively inex- 
pensive XOR. Only if a match is found does he proceed 
to calculate the more expensive 1 6 bit CRC, and if that 
matches, the still more expensive 32 bit CRC. The XOR 
test quickly discounts most segments that have big dif- 
ferences. The 16 bit CRC that Is calculated next dis- 
counts most segments that don't have big similarities. 
Hence the most expensive signature, the 32 bit CRC, is 
only calculated in cases where a strong probability ex- 
ists of a match being found delivering a big increase in 
general efficiency. 

[0012] However, techniques, such as Squibb's, that 
construct efficiency enhancing data in the checkpoint 
from some fixed range of relatively Inexpensive signa- 
tures still suffer a numbe r of deficiencies. One deficiency 
is that such techniques cant adapt their derivation of 
efficiency data according to different file types or partic- 
ular patterns within files. Files containing long stretches 
of the same byte, those containing regular patterns of 
bytes and those comprising a small subset of bytes 
cause inordinately frequent matching of the less expen- 
sive signatures where the segments differ, thereby 
causing large numbers of unnecessary calculations of 
the most expensive signature. Another deficiency is that 
the user cannot stipulate the amount of efficiency en- 
hancing data to be derived lor a file, say to reflect the 
likelihood of it being modified and therefore requiring up- 
dating in an on-line backup system. A further deficiency 
is that given some arbitrary limit upon the amount of ef- 
ficiency data that may be derived, maximum perform- 
ance is not achieved. The present invention addresses 
these deficiencies by utilizing a mutti-dimensional hier- 
archical representation of efficiency data that is derived 
at variable rates ol 'resolution". 

Summarv Of The Invention 

[0013] In one aspect the invention provides a method 
of producing a checkpoint which describes a base file, 
the method comprising: dividing the base file into a se- 
ries of segmonts; generating for each segment a seg- 
ment description: and creating from the generated seg- 
ment descriptions a segments description structure as 
the checkpoint. 

[0014] In another aspect the invention provided a 
method ol producing a morph list that defines an updat- 
ed version of a base file with reference to the base file 
and a check point for the base file which check point is 
produced according to any preceding claim, the method 
comprising: defining a first segment at a start position 
in the updated file; generating a segment description for 
the lirst ssgmenl; comparing the segment description 
for the first segment with segment descriptions for the 
first segment with segment descriptions of the check 
point; and if a match is found, adding the matched seg- 



ment description to the morph list and, if no match is 
found adding data in the first segment to the morph list. 
[0015] The invention also provides a method of gen- 
erating a difference file defining differences between an 

5 updated file and a base file, the method comprising: 
generating a checkpoint defining characteristics of the 
base file in terms of multiple segment descriptions each 
selected to represent a respective segment of the base 
file at a minimum level of resolution sufficient to repre- 

10 sent distinctly the segment; generating at different levels 
of resolution segment descriptions for segments in the 
updated file and comparing the generated segment de- 
scriptions with segment descriptions in the checkpoint 
to identify matching and non-matching segments; and 

»s storing as the difference file data identifying segments 
in the updated file that match segments in the base file 
and data representing portions of the updated file at a 
minimum level of resolution sufficient to represent dis- 
tinctly the portion. 

zo [0016] As will become clear from the description that 
follows, the invention offers several advantages over 
hitherto known approaches. The invention enables 
checkpoints to be composed from signatures that iden- 
tify segments and data that enhances difference gener- 
is ation efficiency, and th us to derive adaptively derive the 
efficiency enhancing checkpoint data according to the 
base file type to achieve better performance. The inven- 
tion enables the efficiency enhancing data contained in 
the checkpoint to be hierarchically derived and stored 

30 so as to minimize the required storage size. The inven- 
tion enables representations of differences to be gener- 
ated as efficiently as possible, given any arbitrary limit 
on checkpoint size. The invention can be applied to net- 
works and can reduce network transmission cost in a 

35 variety of network applications. The present invention 
also enables the storage requirement in the backup sub- 
system of a client-sen/er system to be reduced. 
[0017] Briefly stated, special checkpoint data is de- 
rived from a base file. The checkpoint contains signa- 

■*o lures taken from, and uniquely identilying, the sequen- 
tial segments of the base file. The checkpoint also con- 
tains efficiency data, designed to make the following 
process more efficient. A modified version of the base 
file (also referred to as the new version of the base file 

•'s or changed version of the base file or updated version 
of the base file) is presented. A description of the differ- 
ence between the base file and the updated file is gen- 
erated that describes the updated file In terms of new 
bytes and segments that are also present in the base 

so tile. 

[0018] Checkpoint efficiency data (also referred to as 
image data) is derived (also referred to as sampled) to 
hold varying amounts of information about associated 
base file segments. The amount of data held (also re- 
55 lerred to as the resolution) is increased or decreased 
during checkpoint derivation in an attempt. to elicit dis- 
tinguishing detail from the base file segments represent- 
ed. The imago data is hierarchically derived and stored 



7 EP 0 981 090 A1 



in such way that it occupies a similar amount of space 
as though it had been sampled at the lowest resolution 
throughout. During generation of the difference repre- 
sentation, the image data is used to detennine whether 
or not to make expensive signature calculations. Be- 
cause more information is contained within the hierar- 
chical representation of the image data, the method is 
able to calculate wtiether to make signature calculations 
with a greater degree of accuracy, thus improving gen- 
eral efficiency. Because sampling resolution is in- 
creased to find distinguishing segment detail where nec- 
essary, a degree of adaptation to different file types is 
provided, thus reducing the number of file types that can 
produce unusually poor performance. 
[0019] The above and further features of the invention 
are set forth with particularity in the appended claims 
and together with advantages thereof will become clear- 
er from consideration of the following detailed descrip- 
tion of an exemplary embodiment of the invention given 
with reference to the accompanying drawings. 

Brief Description Of The Drawings 

[0020] In the drawings: 

Figure 1 is a flow chart showing a series of proce- 
dures embodying the invention; 
Figure 2 is a flow chart showing another, simplified 
series of procedures embodying the invention; 
Figure 3 is a representation of an image divided into 
segments; 

Figure 4 is a fbw chart showing a high level repre- 
sentation of a routine for scanning a base file; 
Figure 5 shows an example of data in a Segment 
Description; 

Figure 6 shows an example of a multi-level incre- 
mental lossey image bytes sampling scheme; 
Figure 7 shows (a) construction of a segment de- 
scription structure in a first level of resolution, and 
(b) construction of the segment description struc- 
ture in two levels of resolution; 
Figu re 8 is a flow chart showing part of the flow chart 
of Figure 3 in greater detail; 
Figure 9 shows an example of a segments descrip- 
tion structure where three levels ol lossey resolution 
have been defined; 

Figure 10 is a table showing the storage space con- 
sumed by segment descriptions when stored in a 
checkpoint; 

Figure 11 is a diagram showing how, given some 
hypothetical segments description structure, the 
Segment description Nodes shown in Figure 10 
might be ordered on disk to allow reconstruction ol 
the structure; 

Figure 12 is a high-level flow chart describing a 
I^ATCH program; and 

Figure 1 3 is a How chart showing part ol the MATCH 
program in greater detail. 



B 

Detailed Description 

[0021] In the following specific description a method 
is disclosed of using a special checkpoint representation 

s of an initial or base file to calculate how to transtornn it 
into an updated or provided file. The checkpoint con- 
tains less data than the file represented thereby, thus 
making the method ideal for situations where one file 
cannot be present during patch calculation due to mem- 

'0 ory or storage restrictions. The checkpoint contains two 
types of data derived from sequential blocks in the rep- 
resented file. This consists of signatures that uniquely 
identify each sequential block and lossey Image data 
that approximates their shape. 

15 [0022] The resolution at which image data is extracted 
for individual blocks is varied in order to capture their 
distinguishing features. However the image data is hi- 
erarchically represented so that it requires only margin- 
ally more storage than if the lowest level of resolution 

20 had been used throughout. The method involves toad- 
ing the checkpoint data into a search structure that com- 
prises multi-dimensional hierarchies of trees, with each 
tree comprising structures sorted on image data extract- 
ed for successive levels of resolution. 
[0023] The method then involves moving incremen- 
tally through the provided data file. At each offset the 
method scans the search structure tor blocks described 
in the checkpoint whose shape matches that of the fol- 
lowing block of bytes in the tile If there is no match, then 

30 a byte unique to the provided file has been found. If there 
is a match, the method next involves calculating the sig- 
nature for the next block in the file. 
[0024] If the signature matches the matching check- 
point description, the method knows that it has found a 

35 block in the provided file that exists in the represented 
file and continues searching just beyond it. This process 
continues until a description of the provided file has 
been created in terms of unique bytes and blocks from 
the represented file. High efficiency is delivered be- 

40 cause signatures, which are expensive tocalculate, are 
only calculated for blocks in the provided file when 
matching image data indicates a high probability of 
equivalence being found. Fewer signature calculations 
are made than it the image data had been sampled at 

•iS a single resolution. Further, the method extracts image 
data in a way that provides a degree of adaptation to 
different file types, thus reducing the number of poor 
performance cases where many unnecessary signature 
calculations have to be made. 

so [0025] As shown in Figure 1 of the accompanying 
drawings, the invention may be embodied in tour sepa- 
rate programs that are run sequentially lo generate a 
description ol the dilference between two successive 
versions of a file. The first program, hereafter called the 

ss SCAN program 10, scans a base file to producea mem- 
ory-loaded checkpoint. The second program, hereafter 
called the SAVE program 12, is optional and writes the 
memory-loaded checkpoint to non-volatile storage. The 
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third program, hereafter called the LOAD program 14, 
is also optional and is used in conjunction with the SAVE 
program 1 2 to load a checkpoint from non volatile stor- 
age into memory. The fourth program, hergafler called 
the NIATCH program 16, scans through an updated ver- 
sion of the base tile in conjunction with a memory-loaded 
checkpoint and generates a description of the updated 
base lile In terms of unique bytes and segments of bytes 
that can be found in the original base file. 
[0026] Generally, if the period of time between execu- 
tion of the SCAN and MATCH programs 10, 16 Is large, 
then the SAVE and LOAD programs 12, 14 will be In- 
cluded in the execution sequence, as shown in Figure 
1 . If on the other hand the aforementioned period is 
small, these programs 12, 14 may be excluded. As 
shown in Figure 2, a program, hereafter called the COM- 
PACT program 18, may be run in their place to reduce 
the memory consumed by the loaded checkpoint. 
[0027] The data in a base file may represent anything 
from financial data to news reports, computer program 
to electronic images. The concepts underlying the in- 
vention may be more easily understood rf the data is 
considered to represent an electronic image. In the fol- 
lowing therefore the operation of the various programs 
will be described as if the data in the base file is of an 
image. Those possessed of the appropriate skills will 
appreciate that the invention is not limited to the 
processing of image files and is a equally applicable to 
any tile containing digital data regardless of v*at that 
data represents. 

[0028] Figure 3 of the accompanying drawings shows 
an image 20 represented by data in a base file. As will 
be explained in the following, the SCAN program 10 di- 
vides the data representing the image into a series of 
image segments 21 lo29 of equal length. The length of 
each segment need not be governed by the size of the 
image. Thus, lor example segments 24 and 27 both 
comprise an end portion 24', 27' of one line start portion 
24", 27" of Ihe next line. The SCAN program 10 is shown 
in more detail in Figure 4 of Ihe accompanying drawings. 
The SCAN program 10 works sequentially through a 
base file obtained from a user at step 30 (hereafter a 
■provided base tile"), processing each sequential seg- 
ment in turn, as represented by the blocks 32, 34, 36, 
38, 40 & 42. For each segment the SCAN program cre- 
ates at step 36 a segment description describing the 
segment. Next, al step 38, the SCAN program enters 
the segment description into a segments description 
structure, which describes Ihe base file. 
[0029] As shown in Figure 5 of the accompanying 
drawings segment description 44 is composed from a 
combination of a signature 45 that uniquely identifies the 
bytes in Ihe segment, such as a cyclic redundancy 
check (CRC), and a lossey sample 46 comprising bytes 
sampled from Ihe segment according lo a sampling 
scheme. The segment description 44 also comprises a 
segmeni index 47 identifying where the segment ap- 
pears in the sequence. For example, in Figure 3 seg- 



ment 21 would have an index of "0", segment 22 an in- 
dex of 'r, segment 23 an index of "2' and so on. The 
idea is that bytes can be sampled from the segment In 
a series of increments, thus creating lossey reprgsen- 

s tations of the segment across a range of resolutions. 
There are several ways in which this may be done. One 
way is shown in Figure 6 of the accompanying drawings. 
As shown in Figure 6 the SCAN program 1 0 initially cre- 
ates from a segment 50 a lossey image 52 by sampling 

JO at a maximum level of resolution, chosen by the user. 
Increasingly, lossey images 54, 56 at successively lower 
levels of resolution are constructed by the incremental 
removal of sampled bytes from the level above. The 
segment description therefore effectively holds images 

JS sampled throughout a range of resolutions from lossless 
down to the most lossey, lowest level of resolution. 
[0030] In Figure 6, the segment 50 is shown as having 
a length of 1 6 bytes. Three levels of resolution are also 
shown. This is only to simplify the drawing. In practice 

20 a segment length of 512 bytes would be more realistic, 
and 16 levels of resolution would be more typical. The 
bytes contained in the segment 50 provide a lossless 
representatbn ol the image segment. At the next level 
down, a lossey sample 52 is created by sampling the 
lossless segment 50 at the maximum sampling resolu- 
tion, in this case resolution 3 corresponding to six out of 
the 16 bytes of the lossless segment. This kjssey seg- 
ment 52 comprises a subset of the bytes contained in 
the segment 50 Bytes from the least lossey, i.e. highest 

30 resolution sample 52 are used to create further lossey 
images 54, 56 at lower resolutions. 
[0031] The purpose of creating a range of image sam- 
ples at different resolutions is to create a set of signa- 
tures or samples of the segment that may be used lo 

3S represent the segment. Thus, the set of signatures or 
samples enable a segmeni to be described across a 
range of resolutions, from lossless reproduction of the 
image from the signature down to a lossey reproduction 
of the image sample al the lowest resolution. Initially the 

40 SCAN program constructs segment descriptions that 
hold dala comprising lossey images across the full 
range of possible resolutions. Plainly, however, keeping 
all samples tor each signature would consume large 
amounts of memory by creating a checkpoint larger than 

■JS the file that it represents. The SCAN program 1 0 there- 
fore also creates a Segments Description Structure in 
which signature and lossey sample data is removed 
from the segment descriptions of Ihe bare file. That is to 
say, Ihe Segments Description Structure enables as 

50 much of that data to be removed as possible 

[0032] Upon initial construction, segment descrip- 
tions are placed into a Segment Descriptions Structure 
that uniquely distinguishes each using a minimum of 
resolution, thereby enabling redundant dala to be delef- 
ed. Figure 7 of the accompanying drawings illustrates 
how the Segments Description Structure is created by 
stage 38 ol Ihe SCAN program 10. Reference will be 
made lirst lo Figure 7(a) which shows a binary tree of 
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segment descriptions 21sd TO 2630 at the lowest level 
of resolution (Resolution 1). Ttie tree is constructed 
starting with segment 21 (see Figure 3) by entering the 
most lossey sample (e.g. sample 56 in Figure 6) as the 
segment description 21 so at the lowest level of resolu- s 
tion (Resolution 1 ) represented by plane 60 in Figure 7 
(a). Next, the most lossey sample of segment 22 (see 
Figure 3) is compared with the segment description 
21sD. Any suitable comparison that gives a "less than", 
'equal to" or "greater than" result may be used. Thus, a '0 
simple comparison of the numerical values represented 
by the data would be sufficient. Of course, more com- 
plex comparisons may be used if desired. In Figure 7(a) 
the value of the segment description 22gD is less than 
that of the segment description 21 so- This is represent- « 
ed in Figure 7(a) by the segment description 22sd being 
placed to the left of the segment description 21 is 
greater than 21 sd and therefore is represented as being 
placed to the right of 21 sq Continuing through the tree 
in Figure 7(a), the value of 24sq is less than that of 21 go 20 
and 22sD, and is placed in the tree to the left of and be- 
low 24sD- The value of 25sd is greater than that of 21 go 
and 23go, and is placed to the right of and below 23so. 
The value of 2630 is less than that of 21so but greater 
than that of 22go, and SSgo is therefore placed to the 2S 
right of a below 22go Thus, the first six segments 21 to 
26 of the image are adequately defined by a tree of seg- 
ment descriptions 21 so 'o 2630 at the lowest level of 
resolution. 

[0033] However, when the SCAN program 1 0 reaches 30 
the segment description 27sd. il is found to be equal to 
that of 25sD. it is thus not possible to define segment 27 
distinctly at the lower level of resolution. Instead, there- 
lore, the lossey sample at the next highest level of res- 
olution (resolution 2) is selected to represent the seg- 35 
ment 27. This is represented in Figure 7(b) by the seg- 
ment description 27go being placed in place in plane 61 . 
Similarly for segment 28 the segment description 2830 
is greater than 21 gp and 23go is equal to 2Sgp, and Is 
greater than 27go. The description 28go is therefore 40 
represented in Figure 7(b) as being placed in plane 61 
below and to the right of 27go. Segment 29 is adequately 
defined by the most lossey sample (at resolution 1 ). The 
segment description 29gQ is greater than 21gQand 2330 
and less than 25go. The segment description 29go is *5 
therefore represented in Figure 7(b) as being placed in 
plane 60 below and to the left of 25go The scan program 
10 continues to add segment descriptions to the seg- 
ment description structure in the manner described until 
descriptions for all segments of the image have been so 
placed in the structure. Figure 8 ol the accompanying 
drawings is a flow chart showing in greater detail how 
the process represented by block 38 in Figure 5 and de- 
scribed with reference to Figure 7 enters a segment de- 
scription into the Segment Descriptions Structure. Res- ss 
olution can range from the lowest lossey resolution up 
to the lossless (signature based) resolution. First in step 
80 the current resolution is set to the lowest lossey level 
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and in step 82 the current tree is set to the single root 
tree that is sorted on that resolution (which will be empty 
upon insertion of the first Segment description). Next, in 
step 84, an image description of the segment of bytes 
at the current working offset is compiled at the current 
level of resolution. The working offset WO is simply the 
segment index which identifies where the segment ap- 
pears in the segment sequence. An attempt is made in 
step 86 to match it against a Segment description in the 
current tree. If a match is not found - as represented by 
decision 88 - and therefore a unique place in the tree is 
available, a check is made at decision 90 as to whether 
the current resolution is lossless. If thecurrent resolution 
is lossless, then in step 92 a new Segment description 
is created using the current segment's signature, which 
is entered into the tree, and the process returns. 
[0034] It on the other hand the current resolution is 
lossey, then in step 94 the current segment's signature 
, is first calculated. Then, in addition to the lossey image 
description already calculated for the current resolution 
in step 84, lossey image descriptions are compiled for 
any lossey resolutions above the current resolution. The 
combined lossey image descriptions and the signature 
are then used aggregated to create a new Segment de- 
scription that is duly entered into the current tree at step 
96 and the process returns. 

[003S] If a match had been found in the current tree 
at decision 88, a check is made at decision 98 as to 
w/helher the current resolution is lossless. If the current 
resolution is lossless then a Segment description de- 
scribing the current segment's byte pattern (albeit in a 
different segment) is already present and the process 

lossey, then more resolution exists to elicit a difference 
and a check is made at decision 100 as to whether the 
matching Segment description already exists on a high- 
er plane of resolution. 

[0036] If this is the case, then at step 104 the current 
tree is set to the higher level tree of which the matching 
Segment description is root and at step 106 the current 
level of resolution is incremented The method then re- 
turns at step 84 to compiling an image description at the 
current resolution. Alternatively, if at decision 100 a 
higher tree Is not available, then at step 102 the image 
description contained within the matching Segment de- 
scription that was compiled at the resolution one above 
the current resolution is taken and used to form the root 
of a tree sorted upon that resolution. The process then 
continues at step 104 as previously described. 
[0037] Figure 9 of the accompanying drawings shows 
an example ol a Segments Description Structure com- 
prising a multi-dimensional hierarchy of binary trees of 
segment descriptions ordered by, and successively 
sorted on, successive levels of resolution 60, 62, 64, 66. 
In this example, all hierarchies share the same root 68, 
which is the root of the single tree that is sorted upon 
the bytes used to construct an image at the lowest level 
of resolution 60. The trees, which range from the single 
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root tree sorted upon the lowest resolution 60, to trees 
sorted upon the signature 66. never contain segment 
descriptions with matching values. Instead, wtien a seg- 
ment description is entered into the lowest resolution 
tree and a matching description is already in place, the 
matching description is taken to represent the root of a 
tree at the next level of resolution and the new segment 
description is thus entered into that tree. Segment de- 
scriptions can thereby be simultaneously present in suc- 
cessive planes of resolution as shown by nodes SD^, 
and SDx2- ^^'^ process continues recursively until either 
a unique place within a tree is found, or a matching de- 
scription is found in a tree sorted on a lossless level ol 
resolution 66. This indicates that the current segment's 
pattern of bytes has already been recorded and the seg- 
ment description is simply discarded. 
[0038] A feature of the segments description struc- 
tures is that for each segment description placed in a 
resolution n sorted tree, the bytes that compose its los- 
sey images at resolutions 1 to (n-1 ) are described by the 
root descriptions of trees higher in its tree hierarchy. This 
enables implementers to achieve memory space econ- 
omies while constructing a structure, and then, after it 
is completed, when it is stored or compacted. 
[0039] It will be appreciated from the foregoing de- 
scription that during the construction of the Segment De- 
scription Structure a segment description may need to 
be made the root of a higher resolution tree upon inser- 
tion of another description As this can happen repeat- 
edly until a signature sorted tree is reached, the bytes 
needed to represent higher resolutions must be held by 
each description until the structure is completed. But no 
such restriction holds for the bytes representing image 
resolutions below that of the tree in which the description 
is initially placed as these are implied by root descrip- 
tions ol trees higher in the hierarchy Thus during con- 
struction, these bytes may be omitted Irom descriptions 
to achieve a significant space economy However, a 
much greater and more significant economy can be 
achieved once construction of a segment description 
structure is completed. At this point insertion into the 
structure has ceased and segment descriptions no long- 
er need to hold the bytes needed to make them the root 
of a new higher resolution tree. Therefore a new repre- 
sentation of the structure can be created, in which de- 
scriptions hold only the increment ol bytes that distin- 
guish the resolutions of tfie different trees in which they 
are present. However, because the great majority of de- 
scriptions appear only in one tree, even with unfavorable 
files, the total space required to represent a completed 
segments description structure is generally not much 
greater than that that would required if only a single sam- 
pling resolution had been used. 
[0040] The present invention is not limited to the 
aforementioned means of hierarchical representation of 
segment data, nor the aforemenlioned data sampling 
schemes, nor the aforementioned method for construc- 
tion ol lossoy images. For example, it is conceivable that 
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Ihe trees within the aforementioned segments descrip- 
tion structure might be replaced by a hashing scheme. 
Similarly, not only are many schemes for sampling Im- 
age bytes possible, but a range of lossey images could 

s be defined as constructed from a mathematical repre- 
sentation of sampled bytes as opposed to their simple 
aggregate. It is the utilization of hierarchical storage of 
lossey representations of base file segments derived to 
hold varying degrees ol information up to some maxi- 

'0 mum level that enables efficient representation of a file. 
[0041] Once the SCAN program 10 has completed 
the Segments Description Structure and relumed, either 
Ihe SAVE program 1 2 or the COIWIPACT program 1 8 can 
be run. The COMPACT program 18 simply reduces the 
amount of onboard memory the Segments Description 
Structure consumes by removing the aforementioned 
redundant data therefrom. The SAVE program 12, on 
the other hand, writes a compacted version of the Seg- 
ments Description Structure to non-volatile storage, 

20 such as a hard disk, to generate a checkpoint for the 
provided base file. 

[0042] With reference now to the table in Figure 10 of 
the accompanying drawings, the four types of check- 
point-stored Segment Descriptions (SDs) used by the 

25 SAVE program 12 will be described. The figures in the 
Lossoy Image Bytes and Signature Bytes columns are 
derived from Ihe SCAN program 10 shown in Figure 4 
and the use of 32-bit CRC signatures respectively. Node 
types L, H and S can be directly mapped onto Segment 

30 Descriptions SDs existing in the Segments Description 
Structure. The last node type, X, makes It possible to 
describe Segment Description Structures as a sequen- 
tial list of SD Nodes such that It may be reconstructed 
using a macro-like expansion, as will be described here- 

3S inbelow with reference to Figure 11. 

[0043] As the aforementioned sequential list of SD 
nodes is comprised ol node types requiring from zero to 
six bytes of storage, the individual nodes cannot later 
be extracted by reading equal consecutive blocks ol da- 

-to la Therefore to make their later decoding possible, an 
array ol their binary type codes is written also. As two 
bits are needed to specily each type code, the size of 
the array is equal to the number of sequential segments, 
plus the number of trees in the Segments description 

45 structure, divided by lour, rounded up - a relatively small 
quantity. 

[0044] Figure 11 ol the accompanying drawings illus- 
trates how a sequential list 1 00 of Segment Description 
(SD) nodes is created by Ihe SAVE program 1 2 Irom Ihe 

50 binary trees in plural planes of resolution 110, 112, 114 
and 116 produced by the SCAN program 10. It can be 
seen from Figure 11 that the existence of H type nodes 
in signature sorted trees can be inferred from the array 
context, thus enabling them to be omitted from the array 

55 These H nodes are also exceptional because, as Ihey 
do not describe an increment between lossey resolu- 
tions, they require no storage in the list ol SD Nodes. 
Also, as the order of SD Nodes bears no relation to the 
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sequence index of the corresponding sequential seg- 
mer\l an array providing the sequence indexes of L and 
S type nodes must also be written. 
[0045] The calculation of how to transform the base 
file into the updated file is performed by the NIATCH pro- 
gram 16. The MATCH program 16 requires both the up- 
dated file and the corresponding base file's loaded Seg- 
ments Descriptions Structuteas parameters. If the Seg- 
ments Description Structure is not already in nnenrory 
then the LOAD program 14 can be called to load it up 
from the corresponding base file's checkpoint. The NI- 
ATCH program 16 then generates a twiorph List describ- 
ing the updated file as a list of new bytes and sequential 
segments from the base file (identified by their Index 
numbers). The operations performed by the MATCH 
program 1 6 are illustrated in Figure 1 2 of the accompa- 
nying drawings As shown therein, the NIATCH program 
1 6 begins at step 120 by setting its working offset in the 
updated file to zero (the first byte). The MATCH program 
1 6 then involves moving incrementally through the pro- 
vided updated file. In step 122 the MATCH program 16 
at each offset scans the Segments Description Struc- 
ture for a segment with an image description matching 
the following segment of bytes in the updated file. If 
there is no match, then decision 124 determines that a 
byte unique to the provided file has been found. The 
method then proceeds to step 1 26 where the byte at Ihe 
current working offset is appended to the Morph List. 
The working offset is then incremented by one at step 
128. If, alternatively, a match is determined at decision 
1 24, then a segment has been found in the updated file 
that exists in the base file. The segment identifier is then 
appended to the Morph List at step 1 30 and Ihe working 
olfset is incremented by Ihe standard segment length at 
step 132. This process continues until at decision 134 it 
is determined that the working offset extends beyond 
the end of the file. At this point a description of the up- 
dated file has been created in terms of unique bytes and 
segments from Ihe base file. 

[0046] The procedure by which the MATCH program 
16 determines in step 122 whether the Segments De- 
scription Structure contains a matching description of a 
segment is illuslrated in Figure 13 of the accompanying 
drawings. Resolution can range from the lowest lossey 
resolution up to the lossless (signature based) resolu- 
tion The procedure 1 22 first sets the current resolution 
to Ihe lowest lossey level at step 140 and then selects 
the single root tree that is sorted upon it at step 142. 
Next at step 144 a description of the current segment's 
image is derived at the current resolution (which can be 
either lossey or lossless). At slap 146 an attempt is 
made to find a Segment description in Ihe current tree 
that contains a matching image description. 
[0047] If a match cannot be found at decision 148, 
then the Segments Descriptions Structure cannot hold 
a matching image description. So at step 150 the pro- 
cedure returns thai no match has been found. II, alter- 
natively, there is a match, at decision 152 a check is 
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made as to whether the current resolution (R) is loss- 
less. It the current resolution is lossless, then the seg- 
ment's signature matches the signature held in the Seg- 
ment description, and at step 1 54 SD match is set and 
5 a match is returned. 

[0048] If, on the other hand, II is determined at deci- 
sion 152 that Ihe current resolution is lossey. then a 
check is made at decision 1 56 as to whether the match- 
ing Segment Description is also Ihe root of a tree sorted 

10 upon a higher resolution. II this is the case, then the 
higher tree is entered at step 162, the current resolution 
is set to be one level higher at step 164, and the proce- 
dure returns to step 144 to derive a new description of 
the segment's image at the new resolution. 

75 [0049] If it is determined at decision 1 56 that no higher 
resolution tree is available, then it is no longer possible 
to put off calculating a signature for the segment. The 
signature of the segment's bytes is therefore calculated 
and compared to the signature held in the Segment De- 

20 scription at step 158. If the signatures are found to be 
the same at decision 160, the method returns the Seg- 
ment Description as a match at step 154. If the signa- 
tures aren't the same the method returns that no match 
has been found at step 1 50. 

25 [0050] Those possessed of the appropriate skills will 
appreciated from the foregoing description of the SCAN 
program 1 0 that if Segments Description Structures are 
constructed using a large sampling resolution, a rela- 
tively small penalty is paid in terms of the space con- 

30 sumed It is also the case that if an image sampled for 
a segment recurs, then the segment will most likely 
eventually be described by some higher resolution im- 
age. Thus on the one hand the SCAIvl program 1 0 great- 
ly increases the overall sampling resolution relative to 

35 the size of the checkpoint, and on the other hand it con- 
centrates higher resolution sampling on describing seg- 
ments whose lower resolution images occur most com- 
monly 

[0051] Consequently Ihe performance characteristics 
40 of the NIATCH program are improved. The increased 
overall sampling resolution improves its overall perform- 
ance and the adaptive concentration of resolution re- 
duces the performance degrading effect of recurring 
patterns within the represented base file. These adven- 
ts tages are delivered by the novel hierarchical represen- 
tation of data sampled from segments across a range 
ol resolutions as shown in and described with reference 
to Figures 7 and 9 of the accompanying drawings for 
example. 

50 [0052] The above described programs are faster than 
those hitherto known because on average fewer signa- 
ture calculations were made when determining whether 
updated file segments match base file segments de- 
scribed in Ihe checkpoint. This is partly because given 

55 some arbitrary checkpoint size, relatively more lossey 
description bytes can be stored for each base file seg- 
ment and these can later be used lo disqualify updated 
file segments as possible matches without resorting to 
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signature calculation. The abovedescribed hierarchical 
storage of segment descriptions enable more lossey 
bytes to be stored. Therefore 11 is possible that some 
other type of search structure replace the binary trees 
in the data structure hierarchies. In such a case the 
aforementioned algorithms would not change, but refer- 
ences to trees would be replaced by the new structure. 
[0053] Having thus described the present invention by 
reference to a preferred embodiment It is to be well un- 
derstood that the embodiment in question is exemplary 
only and that modifications and variations such as will 
occur to those possessed of appropriate knowledge and 
skills may be made without departure from the spirit and 
scope of the invention as set forth in the appended 
claims and equivalents thereof. 



1. A method of producing a checkpoint which de- 
scribes a base file, the method comprising: 

dividing the base file into a series of segments; 
generating for each segment a segment de- 
creating from the generated segment descrip- 
tions a segments description structure as the 
checkpoint. 

2. A method as claimed in claim 1, wherein all seg- 
ments are of equal, predetermined length. 

3. A method as claimed in claim 1 , wherein each seg- 
ment description comprises a lossless signature 
and a plurality of lossey samples each describing 
'the segment at a different level of resolution. 

4. A method as claimed in claim 3, wherein the plural- 
ity of lossey samples comprises a first lossey sam- 
ple containing data selected from the segment oat 
a first level of resolution and a second lossey sam- 
ple containing data selected from the segment at a 
second, lower level of resolution, and the data se- 
lected for the second lossey sample is a subset of 
the data selected for the first lossey sample. 

5. A method as claimed in claim 1 , wherein each seg- 
ment description comprises a segment index defin- 
ing the position of the segment in the series. 

6. A method as claimed in claim 3, wherein the seg- 
ments description structure is created by selecting 
for each segment from among the plural lossey 
samples and the lossless signature a description 
that adequately distinguishes the segment to the 
lowest level of resolution. 

7. A method as claimed in claim 6. wherein the de- 
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scription is selected for each segment by comparing 
the plural lossey samples and the signature with re- 
spective plural lossey samples and signatures of 
other segments earlier in the sequence starting with 
s the tossey samples at the lower level of resolution. 

8. A method as claimed in claim 6, wherein the seg- 
ments description structure comprises binary tree 
structures created by comparing a characteristic of 

10 a segment description of one segment with the 
characteristic of a segment description of another, 
previously entered segment, and detemitning 
whether the characteristic is greater than, less than, 
or equal to the characteristic of the other segment. 

75 

9. A method as claimed in claim 8, wherein the seg- 
ment description of the one segment is entered into 
a binary tree at the same level of resolution as the 
segment description of the other segment if the 

20 characteristic of the one segment is less than or 
greater than the characteristic of the other segment, 
and the segment description of the one segment is 
entered into a binary tree at a higher level of reso- 
lutbn than the segment description of the other seg- 

2s ment of the characteristic of the one segment is 
equal to the characteristic of the other segment. 

10. A method as claimed in claim 6, wherein the seg- 
ments description structure is created by entering 

30 the lossless signature of a segment only if none of 
the plural lossey samples adequately distinguish 
the segment. 

11. A method as claimed in claim 6, wherein, once a 
3S segment description has been placed in the seg- 
ments description structure, redundant data is re- 
moved from the segment description in order to re- 
duce the amount of data in the segments descrip- 



12. A method as claimed in claim 11, wherein the re- 
dundant data comprises lossey samples at a reso- 
lution greater than that at which the segment de- 
scnption is entered into a binary tree. 

4S 

13. A method as claimed in claim 11, wherein the re- 
dundant data comprises information derivable from 
lossey samples entered into binary trees at resolu- 
tions lower than that at which the segment descrip- 

so tion is entered into a binary tree, 

14. A method of producing a morph list that defines an 
updated version of a base file with reference to the 
base file and a check point for the base file which 

ss check point is produced according to any preceding 
claim, the method comprising: 

delining a first segment at a start position in the 
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updated file; 

generating a segment description for the first 

segment; 

comparing the segment description lor the first 
segment with segment descriptions for the first 5 
segment with segment descriptions of the 
check point; and 

if a match is found, adding the matched seg- 
ment description to the morph list and, if no 
match is found adding data in the first segment 'O 
to the morph list. 

15. A method as claimed in claim 1 4, further comprising 
defining a second segment, and wherein the sec- 
ond segment is defined at a position adjacent to the 's 
first segment if a match is found and the second 
segment is defined at a position overlapping but not 
including the data added to the morph list if no 
match is found. 

20 

16. A method as claimed in claim 14, wherein the data 
in the first segment added to the morph list compris- 
es a first byte in the segment. 

17. A method as claimed in claim 14, wherein the seg- 2S 
ment description for the first segment is compared 
with segment descriptions of the checkpoint by 
comparing samples of the first segment with sam- 
ples in the checkpoint starting with the tossey sam- 
ples at the lower level of resolution. 30 

18. A method as claimed in claim 17, wherein, when a 
match is found, samples are compared at increas- 
ing levels of resolution to identify matching samples 
until no further match is lound, and then the match- 35 
ing samples are compared with the lossless signa- 



1 9. A method of generating a difference file defining dif- 
ferences between an updated file and a base file, 40 
the method comprising; 

generating a checkpoint defining characteris- 
tics of the base file in terms of multiple segment 
descriptions each selected to represent a re- *5 
spective segment of the base file at a minimum 
level of resolution sufficient to represent dis- 
tinctly the segment; 

generating at different levels of resolution seg- 
ment descriptions for segments in the updated so 
file and comparing the generated segment de- 
scriptions with segment descriptions in the 
checkpoint to identify matchingand non-match- 
ing segments; and 

storing as the difference fits data identifying ss 
segments in the updated file that match seg- 
ments in the base lile and data representing 
portions of the updated file at a minimum level 



of resolution sufficient to represent distinctly the 
portion. 
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